Introduction to the data-set

In this tutorial, we will analyze single-cell RNA sequencing data from Ho et. al, Genome Research, 2018. They are derived from 451Lu melanoma cell line in two different conditions: 1) parental (untreated) and 2) resistant (treated for 6 weeks with BRAF inhibitors and yet proliferating hence called resistant):

Schematic depicting the experimental protocol.
Schematic depicting the experimental protocol.


Importantly the differenceS between these two cell populations are 1) the exposure to treatment (none versus chronic); 2) the resistance that emerges from the treatment. Hence these two variables (treatment and resistance) are confounded by the experimental set-up. This tutorial is built on the Seurat package; you can refer to their vignettes for more details about the functions.

Data required for this tutorial can be dowloaded from Zenodo: https://zenodo.org/records/11088511

Alternative polyadenylation analysis

The analysis of alternative 3’ UTR usage is based on pipeline previously published (Andreassi et al, 2021) and further improved by Lisa Fournier (TA).

apa_data <- read.csv("./APA_table.csv")

#Data: 
# apa_matrix       : 11232 x 9 table; the columns are:
  # gene: gene name
  # proximal_id: the name of the proximal 3' UTR isoform of the pair (Ip)
  # proximal_length: length (nt) of the proximal 3' UTR isoform (Ip)
  # proximal_occurence_parental: parental pseudo-bulk count for the proximal isoform (sum over all parental cells)
  # proximal_occurence_resistant: resistant pseudo-bulk count for the proximal isoform (sum over all resistant cells)
  # distal_id: the name of the distal 3' UTR isoform of the pair (Id)
  # distal_length: length (nt) of the distal 3' UTR isoform (Id)
  # distal_occurence_parental: parental pseudo-bulk count for the distal isoform (sum over all parental cells)
  # distal_occurence_resistant :resistant pseudo-bulk count for the distal isoform (sum over all resistant cells)
mycols <- c("#6699FF", "#CC33CC")

Calculate PUD and RUD in each sample

We will first calculate the relative proximal usage as well as the log2FC between the distal and the proximal (RUD); both values are relevant when identifying shifts in 3’ UTR. These values are calculated for each sample. We can first calculate the PUD and RUD scores using the following formula: \(PUD=Ip/(Ip+Id)\) and \(RUD=log2(Ip/Id)\).
Examples of PUD with either PUD<0.5 (left; distal 3’ UTR predominantly expressed) or PUD>0.5 (right; proximal predominantly expressed).
Examples of PUD with either PUD<0.5 (left; distal 3’ UTR predominantly expressed) or PUD>0.5 (right; proximal predominantly expressed).
#Calculate PUD and RUD for each condition
apa_data <- apa_data %>%
    mutate(pud_parental = proximal_occurence_parental / (proximal_occurence_parental + distal_occurence_parental))
apa_data <- apa_data %>%
    mutate(pud_resistant = proximal_occurence_resistant / (proximal_occurence_resistant + distal_occurence_resistant))

apa_data <- apa_data %>%
    mutate(rud_parental = log2(proximal_occurence_parental / distal_occurence_parental))

apa_data <- apa_data %>%
    mutate(rud_resistant = log2(proximal_occurence_resistant / distal_occurence_resistant))

# Drop the rows with infinite RUD values for the plots
apa_data <- apa_data[is.finite(apa_data$rud_parental), ]
apa_data <- apa_data[is.finite(apa_data$rud_resistant), ]


# plot the PUD distributions
layout(matrix(ncol=3,nrow=2,c(1,1,3,2,2,4),byrow = TRUE))

multidensity(list(parental=apa_data$pud_parental,resistant=apa_data$pud_resistant),col=mycols, las=1, main="PUD", xlab="PUD")
grid()
multidensity(list(parental=apa_data$rud_parental,resistant=apa_data$rud_resistant),col=mycols, las=1, main="RUD", xlab="RUD")
grid()
boxplot(apa_data$pud_parental, apa_data$pud_resistant, col=mycols, names=c("parental", "resistant"),las=1,frame=FALSE)
result <- wilcox.test(apa_data$pud_parental, apa_data$pud_resistant)
title(paste("P=", format(result$p.value,scientific=TRUE,digit=2)))

boxplot(apa_data$rud_parental, apa_data$rud_resistant, col=mycols, names=c("parental", "resistant"), outline=FALSE,las=1,frame=FALSE)
result <- t.test(apa_data$rud_parental, apa_data$rud_resistant)
title(paste("P=", format(result$p.value,scientific=TRUE,digit=2)))

df <- data.frame(type=c(rep("parental",nrow(apa_data)),rep("resistant",nrow(apa_data))),
                 pud=c(apa_data$pud_parental,apa_data$pud_resistant),
                 rud=c(apa_data$rud_parental,apa_data$rud_resistant))

The PUD distributions are bimodal. Therefore, due to the non-normality of the data, to compare them we will use a non-parametric test: the Wilcoxon test. Concerning the RUD, the data are normal. We therefore use the Student t-test.

Differential APA analysis

We can now identify the genes exhibiting changes in 3’ UTR usage by comparing their PUD between resistant and parental. P-values are obtained using the Fisher count test on the raw read count and changes are identified when \(\Delta PUD >0.15\) or \(\Delta PUD <(-0.15)\) and \(P-value<0.01\).

#Calculate dPUD and dRUD

apa_data <- apa_data %>%
    mutate(dpud = pud_resistant - pud_parental )

apa_data <- apa_data %>%
    mutate(drud = rud_resistant - rud_parental )

#Counts to compute the fisher tests
df_for_fisher = round(apa_data[, c("proximal_occurence_parental", "proximal_occurence_resistant", "distal_occurence_parental", "distal_occurence_resistant")])
# Apply Fisher's Test to each row
results <- apply(df_for_fisher, 1, function(x) {
  # Create a matrix for each row
  mat <- matrix(as.numeric(x), nrow = 2)
  
  # Perform Fisher's Test
  fisher.test(mat)$p.value
})

apa_data$p_value <- results
apa_data$fdr     <- p.adjust(apa_data$p_value,method="fdr")

# Selection of differential 3' UTR usage
is_sig <- apa_data$fdr < 0.01
dist <- (apa_data$dpud <= (-0.15) & apa_data$drud <= (-1))
short <- (apa_data$dpud >= (0.15) & apa_data$drud >= 1)

sel_distal   <- is_sig & dist
sel_proximal <- is_sig & short
# Make a nice plot
coldiff <- c("grey",rgb(0,0,0,0.15))
colsA <- c("#81A4D6","#2D598E","#083872")
colsB <- c("#AE73B1","#79387C","#57055B")


plot(apa_data[,c("pud_parental","pud_resistant")],pch=19,col=rgb(0,0,0,0.1),cex=0.3,las=1,frame=FALSE,xlab="Ip/(Ip+Id) [parental]",ylab="Ip/(Ip+Id) [resistant]")
points(apa_data[sel_proximal,c("pud_parental","pud_resistant")],pch=19,col=colsA[1],cex=0.3)
text(x=0.6,y=1.0,labels=paste(sum(sel_proximal),"proximal-to-distal shifts in resistant"),cex=0.6,col=colsA)
points(apa_data[sel_distal,c("pud_parental","pud_resistant")],pch=19,col=colsB[1],cex=0.3)
text(x=0.6,y=0.0,labels=paste(sum(sel_distal),"distal-to-proximal shifts in resistant"),cex=0.6,col=colsB)

Cell clustering and cell type annotation

Load the normalised data

For details how to produce the data, please refer to the exercises of week 9. You are free the test the different normalisation methods which are all included in the R object.

load("./normalised_seurat_objects.RData")
#Seurat objects:
#GE : raw data
#GE_log : log-normalise
#GE_clr : CLR normalisation
#GE_rc  : RC normalisation
#GE_sct : SCTtransformed
#GE_qn  : quantile normalsied

Dimensionality reduction

Prior to unsupervised clustering, dimensionality reduction is often applied on the most informative genes/features. As performed in week 9, we can use different methods for this task.

Identification of highly variable genes (feature selection)

Selection of top 1000 variable genes.

layout(matrix(ncol=2,nrow=1,c(1:2),byrow = TRUE))
#Identify top 10'000 genes
GE_log <- FindVariableFeatures(GE_log, selection.method = "vst", nfeatures = 10000)

PCA, t-SNE and UMAP

Prior performing the PCA we first need to scale and center the genes.

layout(matrix(ncol=3,nrow=2,c(1:6),byrow = TRUE))
# Scale and Center gene features
GE_log    <- ScaleData(GE_log, features =rownames(GE_log))
#Run PCA
GE_log    <- RunPCA(GE_log, features = VariableFeatures(object = GE_log))
#Run t-SNE
GE_log    <- RunTSNE(GE_log, dims = 1:10,perplexity=10)
#Run UMAP
GE_log    <- RunUMAP(GE_log, dims = 1:10,n.neighbors=5,min.dist=0.4,metric="man")

PCA plot:

pt<-DimPlot(GE_log, reduction = "pca", group.by = "treatment", pt.size=1, cols = mycols,dims = c(1, 2),alpha=0.2) +ggtitle("PCA")
pt

t-SNE plot:

pt<-DimPlot(GE_log, reduction = "tsne", group.by = "treatment", pt.size=1, cols = mycols,dims = c(1, 2),alpha=0.2) +ggtitle("t-SNE")
pt

pt<-DimPlot(GE_log, reduction = "umap", group.by = "treatment", pt.size=1, dims = c(1, 2), cols = mycols,alpha=0.2)+ggtitle("UMAP")
pt

scTransform

This method transforms the counts using the Pearson residuals of a probabilistic model. However even scTransform leads to undesirable biases that can mask true signals and upweight technical artifacts. One of the most popular method for count-based modeling. The idea behind scTransform is to apply PCA to the matrix of Pearson residuals obtained by fitting negative binomial GLMs to the count matrix \(Y\). Specifically for each row \(i\) scTransform fits the following model to data \(Y_{i1},\cdot\cdot\cdot,Y_{iJ}\): \(Y_{ij}\sim NegBinom (\mu_{ij},\alpha_i)\) and \(\log \mu_{ij}=\beta_{0i}+\beta_{1i}\log(S_j)\).

scGBM

It is known that fundamental issues can arise from the commonly used approach of running PCA on \(\log(1+x)\) transformed scRNA-seq data. For instance, (Townes et al. (2019)) showed that the first principal component is strongly correlated with the number of zeros per cell, even on null data. Methods using count models have been developed to address these issues, however, we find that the current leading approaches still have significant limitations.

While most dimensionality reduction approaches apply a transformation to the count matrix followed by principal components analysis (PCA) or any other type, it is known that such approach can induce spurious heterogeneity and mask true biological variability. An alternative approach is to directly model the counts. Some methods perform dimensionality reduction directly using a probabilistic model of the count data matrix.These methods can avoid the artifactual biases of simple transformations and, further, can provide principled uncertainty quantification for downstream analyses and visualization. These include GLM-PCA, which models the entries of the count matrix using a Poisson or negative-binomial distribution and estimates latent factors in the log space. However GLM-PCA suffers from slow runtime and convergence issues on single-cell datasets with millions of cells.

scGBM is a novel method for model-based dimensionality reduction of scRNA-seq data using a Poisson bilinear model. You can access the preprint here. They introduce a fast estimation algorithm to fit the model using iteratively reweighted singular value decompositions. Furthermore, scGBM quantifies the uncertainty in each cell’s latent position and leverages these uncertainties to assess the confidence associated with a given cell clustering. Starting from the same underlying model as in GLM-PCA however with a new estimation algorithm that is faster than existing approaches.

if (!require("devtools", quietly = TRUE))
  install.packages("devtools")
library("devtools")
if (!require("scGBM", quietly = TRUE))
  devtools::install_github("phillipnicol/scGBM")
library("scGBM")

#https://github.com/phillipnicol/scGBM
set.seed(1126490984)
#Run scGBM with latent factors
out                  <- scGBM::gbm.sc(as.matrix(GE@assays$RNA$counts),M=10)# M is the number of latent factors
#out.proj             <- scGBM::gbm.sc(as.matrix(GE@assays$RNA$counts),M=2,subset=100,ncores=8) ; #
colnames(out$scores) <- 1:10
GE[["gbm"]] <- CreateDimReducObject(embeddings=out$scores,key="GBM_")

save(list=c("out","GE"),file="./results_scGBM.RDtata")
load("./results_scGBM.RDtata")
pt <- DimPlot(GE, reduction = "gbm", group.by = "treatment", pt.size=1, cols = mycols, dims = c(1, 2))+ggtitle("GBM")
pt

Clustering

Seurat applies a graph-based clustering approach, building upon initial strategies in (Macosko et al). First, a KNN graph based on the euclidean distance in PCA space in constructed, and the edge weights between any two cells are refined based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (first 10 PCs).

To cluster the cells, modularity optimization techniques such as the Louvain algorithm (default) or SLM [SLM, Blondel et al., Journal of Statistical Mechanics] are applied, to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters() function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. The authors of Seurat found that setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets.

Seurat

GE_log <- FindNeighbors(GE_log, dims = 1:10)
GE_log <- FindClusters(GE_log, resolution = 0.08)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6414
## Number of edges: 207228
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9602
## Number of communities: 3
## Elapsed time: 1 seconds
DimPlot(GE_log, reduction = "tsne")

It is now possible to extract all cells originating from one cluster:

# Look at the IDs of the cells in the cluster 8
cluster2 <- subset(GE_log, subset = seurat_clusters == 2)
colnames(cluster2)

scGMB

GE                   <- FindNeighbors(GE,reduction = "gbm")
GE                   <- FindClusters(GE)
DimPlot(GE, reduction = "gbm")

Differential expression analysis between clusters

Here the goal is to find markers that define clusters via differential expression (DE). You can either: - identify positive and negative markers of a single cluster compared to all other cells - test groups of clusters vs. each other, or against all cells.

# find markers for every cluster compared to all remaining cells, report only the positive ones
# This cell takes ~15-20minutes to run
GE_log.markers <- FindAllMarkers(GE_log, only.pos = TRUE)
GE_log.markers %>%
    group_by(cluster) %>%
    dplyr::filter(avg_log2FC > 1)
## # A tibble: 1,754 × 7
## # Groups:   cluster [3]
##    p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene   
##    <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>  
##  1     0       3.31 0.957 0.257         0 0       BCAS3  
##  2     0       4.78 0.608 0.021         0 0       MDK    
##  3     0       4.43 0.58  0.019         0 0       RAC2   
##  4     0       6.92 0.558 0.003         0 0       SLC17A9
##  5     0       3.42 0.618 0.096         0 0       UNC13D 
##  6     0       2.21 0.727 0.206         0 0       S100A1 
##  7     0       2.06 0.91  0.399         0 0       RAMP1  
##  8     0       2.16 0.787 0.323         0 0       DCBLD2 
##  9     0       2.30 0.955 0.494         0 0       CAPG   
## 10     0       2.11 0.806 0.349         0 0       KCNAB2 
## # ℹ 1,744 more rows
GE_log.markers %>%
    group_by(cluster) %>%
    dplyr::filter(avg_log2FC > 1) %>%
    slice_head(n = 20) %>%
    ungroup() -> top8
DoHeatmap(GE_log, features = top8$gene) + NoLegend()

We can perform gene enrichment analysis on the top 300 u-regulated genes in cluster n0.2:

# print the top marker genes for cluster 2
gs<-head(GE_log.markers[GE_log.markers$cluster == 2,], n=300)$gene

gostres_diff <- gost(query =gs, 
                organism = "hsapiens", ordered_query = FALSE, 
                multi_query = FALSE, significant = TRUE, exclude_iea = FALSE, 
                measure_underrepresentation = FALSE, evcodes = TRUE, 
                user_threshold = 0.05, correction_method = "g_SCS", 
                domain_scope = "annotated", custom_bg = NULL, 
                numeric_ns = "", as_short_link = FALSE,sources=c("GO:BP", "GO:MF","KEGG","REAC"))
gostplot(gostres_diff, capped = FALSE, interactive = TRUE)

Cell type identification

if (!require("HGNChelper", quietly = TRUE))
  install.packages("HGNChelper")
library("HGNChelper")
if (!require("openxlsx", quietly = TRUE))
  install.packages("openxlsx", dependencies = TRUE)
library("openxlsx")

Assignment

Task 1 (1pt)

In this task, use sctype for cell type annotation (see the documentation here: https://github.com/IanevskiAleksandr/sc-type). The expected output is a plot of the clusters with cell type annotation. You are free to use the dimensionality reduction technique that you want among the ones tested above; and to choose which tissue to test.

Comment: In the article Symptoms of advanced melanoma - Cancer Research UK, it is stated that melanoma can spread most offen spreads to: - lymph nodes - lungs - liver - bones - brain - abdomen

For this reason I have decided to consider the following tissues:

Immune system

#https://github.com/IanevskiAleksandr/sc-type

# Load ScType functions
# load gene set preparation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
# load cell type annotation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")
# load wrapper function 
source("https://raw.githubusercontent.com/kris-nader/sc-type/master/R/sctype_wrapper.R"); 

# By default, we use our in-built cell marker DB
db_ <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";

# provide a tissue type your data belongs to
tissue <- "Immune system" 
# e.g. Immune system, Pancreas, Liver, Eye, Kidney, Brain, Lung, Adrenal, Heart, Intestine, Muscle, Placenta, Spleen, Stomach, Thymus 
# prepare gene sets
gs_list <- gene_sets_prepare(db_, tissue)

# check Seurat object version (scRNA-seq matrix extracted differently in Seurat v4/v5)
seurat_package_v5 <- isFALSE('counts' %in% names(attributes(GE_log[["RNA"]])));
print(sprintf("Seurat object %s is used", ifelse(seurat_package_v5, "v5", "v4")))
## [1] "Seurat object v5 is used"
# extract scaled scRNA-seq matrix
scRNAseqData_scaled <- if (seurat_package_v5) as.matrix(GE_log[["RNA"]]$scale.data) else as.matrix(GE_log[["RNA"]]@scale.data)

# run ScType
es.max <- sctype_score(scRNAseqData = scRNAseqData_scaled, scaled = TRUE, gs = gs_list$gs_positive, gs2 = gs_list$gs_negative)

# merge by cluster
cL_resutls <- do.call("rbind", lapply(unique(GE_log@meta.data$seurat_clusters), function(cl){
    es.max.cl = sort(rowSums(es.max[ ,rownames(GE_log@meta.data[GE_log@meta.data$seurat_clusters==cl, ])]), decreasing = !0)
    head(data.frame(cluster = cl, type = names(es.max.cl), scores = es.max.cl, ncells = sum(GE_log@meta.data$seurat_clusters==cl)), 10)
}))
sctype_scores <- cL_resutls %>% group_by(cluster) %>% top_n(n = 1, wt = scores)  

# set low-confident (low ScType score) clusters to "unknown"
sctype_scores$type[as.numeric(as.character(sctype_scores$scores)) < sctype_scores$ncells/4] <- "Unknown"
print(sctype_scores[,1:3])
## # A tibble: 3 × 3
## # Groups:   cluster [3]
##   cluster type      scores
##   <fct>   <chr>      <dbl>
## 1 0       Platelets 891.  
## 2 2       Unknown     3.94
## 3 1       Unknown   342.
GE_log@meta.data$sctype_classification = ""
for(j in unique(sctype_scores$cluster)){
  cl_type = sctype_scores[sctype_scores$cluster==j,]; 
  GE_log@meta.data$sctype_classification[GE_log@meta.data$seurat_clusters == j] = as.character(cl_type$type[1])
}

DimPlot(GE_log, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'sctype_classification')

DimPlot(GE_log, reduction = "tsne", label = TRUE, repel = TRUE, group.by = 'sctype_classification')

Using two different techniques for dimensionality reduction (UMPA and t-SNE), we can identify three clusters: one of platelets (with a score of 891.47) and two of unknown (one with score 3.94 and one with score 341.60).

Lung

#https://github.com/IanevskiAleksandr/sc-type

# Load ScType functions
# load gene set preparation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
# load cell type annotation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")
# load wrapper function 
source("https://raw.githubusercontent.com/kris-nader/sc-type/master/R/sctype_wrapper.R"); 

# By default, we use our in-built cell marker DB
db_ <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";

# provide a tissue type your data belongs to
tissue <- "Lung" 
# e.g. Immune system, Pancreas, Liver, Eye, Kidney, Brain, Lung, Adrenal, Heart, Intestine, Muscle, Placenta, Spleen, Stomach, Thymus 
# prepare gene sets
gs_list <- gene_sets_prepare(db_, tissue)

# check Seurat object version (scRNA-seq matrix extracted differently in Seurat v4/v5)
seurat_package_v5 <- isFALSE('counts' %in% names(attributes(GE_log[["RNA"]])));
print(sprintf("Seurat object %s is used", ifelse(seurat_package_v5, "v5", "v4")))
## [1] "Seurat object v5 is used"
# extract scaled scRNA-seq matrix
scRNAseqData_scaled <- if (seurat_package_v5) as.matrix(GE_log[["RNA"]]$scale.data) else as.matrix(GE_log[["RNA"]]@scale.data)

# run ScType
es.max <- sctype_score(scRNAseqData = scRNAseqData_scaled, scaled = TRUE, gs = gs_list$gs_positive, gs2 = gs_list$gs_negative)

# merge by cluster
cL_resutls <- do.call("rbind", lapply(unique(GE_log@meta.data$seurat_clusters), function(cl){
    es.max.cl = sort(rowSums(es.max[ ,rownames(GE_log@meta.data[GE_log@meta.data$seurat_clusters==cl, ])]), decreasing = !0)
    head(data.frame(cluster = cl, type = names(es.max.cl), scores = es.max.cl, ncells = sum(GE_log@meta.data$seurat_clusters==cl)), 10)
}))
sctype_scores <- cL_resutls %>% group_by(cluster) %>% top_n(n = 1, wt = scores)  

# set low-confident (low ScType score) clusters to "unknown"
sctype_scores$type[as.numeric(as.character(sctype_scores$scores)) < sctype_scores$ncells/4] <- "Unknown"
print(sctype_scores[,1:3])
## # A tibble: 3 × 3
## # Groups:   cluster [3]
##   cluster type                             scores
##   <fct>   <chr>                             <dbl>
## 1 0       Pulmonary alveolar type II cells 1115. 
## 2 2       Unknown                           -12.8
## 3 1       Unknown                           526.
GE_log@meta.data$sctype_classification = ""
for(j in unique(sctype_scores$cluster)){
  cl_type = sctype_scores[sctype_scores$cluster==j,]; 
  GE_log@meta.data$sctype_classification[GE_log@meta.data$seurat_clusters == j] = as.character(cl_type$type[1])
}

DimPlot(GE_log, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'sctype_classification')

DimPlot(GE_log, reduction = "tsne", label = TRUE, repel = TRUE, group.by = 'sctype_classification')

Also in this case, where we are considering lung instead of immune system, we find three clusters: pulmonary alveolar type II cells (score: 1114.69), unknown (score: -12.82) and unknown (score: 526.42).

Liver

#https://github.com/IanevskiAleksandr/sc-type

# Load ScType functions
# load gene set preparation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
# load cell type annotation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")
# load wrapper function 
source("https://raw.githubusercontent.com/kris-nader/sc-type/master/R/sctype_wrapper.R"); 

# By default, we use our in-built cell marker DB
db_ <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";

# provide a tissue type your data belongs to
tissue <- "Liver" 
# e.g. Immune system, Pancreas, Liver, Eye, Kidney, Brain, Lung, Adrenal, Heart, Intestine, Muscle, Placenta, Spleen, Stomach, Thymus 
# prepare gene sets
gs_list <- gene_sets_prepare(db_, tissue)

# check Seurat object version (scRNA-seq matrix extracted differently in Seurat v4/v5)
seurat_package_v5 <- isFALSE('counts' %in% names(attributes(GE_log[["RNA"]])));
print(sprintf("Seurat object %s is used", ifelse(seurat_package_v5, "v5", "v4")))
## [1] "Seurat object v5 is used"
# extract scaled scRNA-seq matrix
scRNAseqData_scaled <- if (seurat_package_v5) as.matrix(GE_log[["RNA"]]$scale.data) else as.matrix(GE_log[["RNA"]]@scale.data)

# run ScType
es.max <- sctype_score(scRNAseqData = scRNAseqData_scaled, scaled = TRUE, gs = gs_list$gs_positive, gs2 = gs_list$gs_negative)

# merge by cluster
cL_resutls <- do.call("rbind", lapply(unique(GE_log@meta.data$seurat_clusters), function(cl){
    es.max.cl = sort(rowSums(es.max[ ,rownames(GE_log@meta.data[GE_log@meta.data$seurat_clusters==cl, ])]), decreasing = !0)
    head(data.frame(cluster = cl, type = names(es.max.cl), scores = es.max.cl, ncells = sum(GE_log@meta.data$seurat_clusters==cl)), 10)
}))
sctype_scores <- cL_resutls %>% group_by(cluster) %>% top_n(n = 1, wt = scores)  

# set low-confident (low ScType score) clusters to "unknown"
sctype_scores$type[as.numeric(as.character(sctype_scores$scores)) < sctype_scores$ncells/4] <- "Unknown"
print(sctype_scores[,1:3])
## # A tibble: 3 × 3
## # Groups:   cluster [3]
##   cluster type        scores
##   <fct>   <chr>        <dbl>
## 1 0       Hepatocytes 1221. 
## 2 2       Unknown      -89.3
## 3 1       Unknown      279.
GE_log@meta.data$sctype_classification = ""
for(j in unique(sctype_scores$cluster)){
  cl_type = sctype_scores[sctype_scores$cluster==j,]; 
  GE_log@meta.data$sctype_classification[GE_log@meta.data$seurat_clusters == j] = as.character(cl_type$type[1])
}

DimPlot(GE_log, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'sctype_classification')

DimPlot(GE_log, reduction = "tsne", label = TRUE, repel = TRUE, group.by = 'sctype_classification')

If we consider liver, we also find three clusters: hepatocytes (score: 1221.15), unknown (score: -89.26) and unknown (score: 278.79).

Brain

#https://github.com/IanevskiAleksandr/sc-type

# Load ScType functions
# load gene set preparation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
# load cell type annotation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")
# load wrapper function 
source("https://raw.githubusercontent.com/kris-nader/sc-type/master/R/sctype_wrapper.R"); 

# By default, we use our in-built cell marker DB
db_ <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";

# provide a tissue type your data belongs to
tissue <- "Brain" 
# e.g. Immune system, Pancreas, Liver, Eye, Kidney, Brain, Lung, Adrenal, Heart, Intestine, Muscle, Placenta, Spleen, Stomach, Thymus 
# prepare gene sets
gs_list <- gene_sets_prepare(db_, tissue)

# check Seurat object version (scRNA-seq matrix extracted differently in Seurat v4/v5)
seurat_package_v5 <- isFALSE('counts' %in% names(attributes(GE_log[["RNA"]])));
print(sprintf("Seurat object %s is used", ifelse(seurat_package_v5, "v5", "v4")))
## [1] "Seurat object v5 is used"
# extract scaled scRNA-seq matrix
scRNAseqData_scaled <- if (seurat_package_v5) as.matrix(GE_log[["RNA"]]$scale.data) else as.matrix(GE_log[["RNA"]]@scale.data)

# run ScType
es.max <- sctype_score(scRNAseqData = scRNAseqData_scaled, scaled = TRUE, gs = gs_list$gs_positive, gs2 = gs_list$gs_negative)

# merge by cluster
cL_resutls <- do.call("rbind", lapply(unique(GE_log@meta.data$seurat_clusters), function(cl){
    es.max.cl = sort(rowSums(es.max[ ,rownames(GE_log@meta.data[GE_log@meta.data$seurat_clusters==cl, ])]), decreasing = !0)
    head(data.frame(cluster = cl, type = names(es.max.cl), scores = es.max.cl, ncells = sum(GE_log@meta.data$seurat_clusters==cl)), 10)
}))
sctype_scores <- cL_resutls %>% group_by(cluster) %>% top_n(n = 1, wt = scores)  

# set low-confident (low ScType score) clusters to "unknown"
sctype_scores$type[as.numeric(as.character(sctype_scores$scores)) < sctype_scores$ncells/4] <- "Unknown"
print(sctype_scores[,1:3])
## # A tibble: 3 × 3
## # Groups:   cluster [3]
##   cluster type                    scores
##   <fct>   <chr>                    <dbl>
## 1 0       Unknown                  723. 
## 2 2       Unknown                  -14.3
## 3 1       Schwann precursor cells 1411.
GE_log@meta.data$sctype_classification = ""
for(j in unique(sctype_scores$cluster)){
  cl_type = sctype_scores[sctype_scores$cluster==j,]; 
  GE_log@meta.data$sctype_classification[GE_log@meta.data$seurat_clusters == j] = as.character(cl_type$type[1])
}

DimPlot(GE_log, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'sctype_classification')

DimPlot(GE_log, reduction = "tsne", label = TRUE, repel = TRUE, group.by = 'sctype_classification')

Also considering brain, we find three clusters: unknown (score: 722.99), unknown (score: -14.33) and schwann precursor cells (score: 1411.39).

Task 2 - BONUS (0.25 pt)

Test another cell type annotation method. You can choose among the methods below (scGPT, deCS and scAnnotate) or propose an alternative method.

Using deCS

if (!require("deCS", quietly = TRUE))
  BiocManager::install("deCS")
library(deCS)

# MonacoImmune_main_t_score
data(MonacoImmune_main)
#https://github.com/bsml320/deCS
#https://academic.oup.com/gpb/article/21/2/370/7585489?login=false
top10 <- GE_log.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
GE_log_top10_markers_list = GE_log.markers[which(GE_log.markers$gene %in% top10$gene),] 
GE_log_cluster_average = log(AverageExpression(GE_log)[[1]] + 1, 2)
GE_log_cluster_marker_average = GE_log_cluster_average[which(rownames(GE_log_cluster_average) %in% top10$gene), ]
GE_log_cluster_marker_z_score = t(scale(t(GE_log_cluster_marker_average)))
GE_log_deCS_cor_panel_A <- deCS.correlation(GE_log_cluster_marker_z_score, MonacoImmune_main_t_score)

Using deCS, instead of sctype, we find three different clusters (‘g0’, ‘g1’ and ‘g2’). For the cluster ‘g0’, the predominant component is progenitors, while for ‘g2’ it is denditric cells.